fix(bwrap): default to deny-by-default filesystem (mirror seatbelt)#482
fix(bwrap): default to deny-by-default filesystem (mirror seatbelt)#482caarlos0 wants to merge 10 commits into
Conversation
The Bubblewrap backend used to bind-mount the entire host root read-only into every sandbox (`--ro-bind / /`), so the caller's $HOME, /root, /opt, /var/sys, /run/user/<uid>, and everything else readable by the calling uid was visible inside the sandbox by default. The macOS Seatbelt backend, by contrast, starts from `(deny default)` and only allows a narrow system baseline -- bwrap now matches that posture. The new baseline (`BASELINE_RO_BIND_PATHS`) mirrors seatbelt's `SYSTEM_READ_ALLOW` allowlist: top-level executable/library dirs (/bin, /sbin, /lib*), the /usr subpaths that seatbelt allows (without /usr/local), /etc, and the DNS stub-resolver directories under /run (/run/systemd/resolve, /run/NetworkManager, /run/resolvconf) so /etc/resolv.conf symlinks still resolve when network is allowed. $HOME, /opt, /usr/local, /var, /sys, and /run/user/<uid> are no longer visible until the caller opts in via `readonlyPaths` / `readwritePaths`. Paths are emitted via `--ro-bind-try` so missing entries are silently skipped (e.g. /lib32 on x86_64-only systems, /run/systemd/resolve on hosts without systemd-resolved). Files in /etc with restrictive perms (/etc/shadow, /etc/sudoers, /etc/ssh/ssh_host_*_key) remain unreadable to a non-root caller even though /etc is bound whole -- user-namespace UID mapping does not bypass kernel DAC. Updated the existing `filesystem_policy_produces_correct_mounts` test and added 5 new tests covering the new contract (no host-root bind, required baseline paths emitted, /usr/local not exposed, confidential paths excluded, DNS dirs included, baseline precedes policy mounts). Docs in docs/bwrap-support/bubblewrap-backend.md updated accordingly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Carlos Alexandro Becker <caarlos0@users.noreply.github.com>
`nanvix_common` is a `[build-dependency]` of `lxc` and `wxc`. Build deps are compiled for the host, so cross-compiling lxc-exec from macOS to aarch64-unknown-linux-gnu pulled nanvix_common into a host build where `target_os` was neither "windows" nor "linux" -- the `REQUIRED_BINARIES` and `NANVIXD_BINARY` constants then had no definition and the crate failed to compile. Add empty/zero fallbacks for non-Windows/Linux hosts. The empty slice is correct because: - NanVix only runs on Windows and Linux, so iterating `REQUIRED_BINARIES` on other hosts must be a no-op. - The consuming build scripts (e.g. `src/core/lxc/build.rs`) already gate the surrounding logic behind `cfg(target_os = "linux")` and `feature = "microvm"`, so the fallback values are never reached in practice. Zero runtime impact on supported platforms; pure build-time portability fix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Carlos Alexandro Becker <caarlos0@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR tightens the Bubblewrap backend’s default filesystem exposure by switching from a full host-root bind mount to a minimal allowlist baseline, adds regression tests for the new deny-by-default posture, and updates docs accordingly. It also adds NanVix constant fallbacks so the NanVix common crate can compile on non-Windows/Linux hosts when used as a build dependency.
Changes:
- Bubblewrap: replace
--ro-bind / /with a minimal baseline set of--ro-bind-trymounts and add targeted regression tests. - NanVix: add non-Windows/Linux fallbacks for
REQUIRED_BINARIESandNANVIXD_BINARYto support host builds on macOS/BSD. - Docs: document the Bubblewrap deny-by-default filesystem model and its consequences.
Show a summary per file
| File | Description |
|---|---|
| src/backends/nanvix/common/src/lib.rs | Adds non-Windows/Linux fallbacks for NanVix host-compiled constants to keep builds working when cross-compiling. |
| src/backends/bubblewrap/common/src/bwrap_command.rs | Introduces a minimal baseline allowlist (deny-by-default) via --ro-bind-try and expands/updates tests. |
| docs/bwrap-support/bubblewrap-backend.md | Documents the new baseline filesystem behavior and user-facing implications. |
Copilot's findings
- Files reviewed: 3/3 changed files
- Comments generated: 5
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
- nanvix: use a descriptive sentinel for NANVIXD_BINARY on unsupported hosts instead of an empty string, so any accidental Command use fails with a named program rather than an empty one. - bwrap: soften "mirrors seatbelt's baseline exactly" comment to "aligned with" to avoid implying exact, lasting parity. - bwrap test: drop the brittle `assert!(ro_pos > 0)` — the preceding `.expect(...)` already guarantees the mount exists. - bwrap test: restrict the /usr/local check to mount-argument windows so a script body mentioning /usr/local cannot cause a false positive. - docs: note the deny-by-default baseline requires bwrap 0.3.0+ for `--ro-bind-try`. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Carlos Alexandro Becker <caarlos0@users.noreply.github.com>
…into bwrap-deny-default
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
The previous MXC-PR-Build run (149353501) failed only on the Linux 1ES agents with transient network errors in the same time window: - SDK Unit Tests (linux): "The SSL connection could not be established" - x64/arm64 LXC builds: cargo exited 101 (dependency fetch failure) All equivalent Windows/macOS jobs passed, and the exact Linux build/test commands plus the SDK unit tests reproduce cleanly and pass locally, so the branch changes are not the cause. Empty commit to re-run CI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Carlos Alexandro Becker <caarlos0@users.noreply.github.com>
|
/azp run |
|
Commenter does not have sufficient privileges for PR 482 in repo microsoft/mxc |
Signed-off-by: Carlos Alexandro Becker <caarlos0@users.noreply.github.com>
| // DNS stub-resolver directories. /etc/resolv.conf is usually a | ||
| // symlink into one of these on modern Linux distros (systemd-resolved | ||
| // / NetworkManager / resolvconf). We bind the narrow subdirectories | ||
| // rather than all of /run to avoid exposing /run/user/<uid>. | ||
| "/run/systemd/resolve", | ||
| "/run/NetworkManager", | ||
| "/run/resolvconf", |
There was a problem hiding this comment.
⚠️ Deny-by-default can silently break DNS when /etc/resolv.conf is a symlink to a path outside the baseline
Severity: Medium (the failure is silent and cryptic, which is what keeps it from Low)
Mechanism
/etc is bound whole, so a /etc/resolv.conf symlink is preserved as a symlink — bwrap doesn't dereference it. When a sandboxed process opens it, the kernel resolves the target inside the sandbox mount namespace, where only the baseline paths exist. Because /var and /mnt are never mounted, any resolv.conf target routed through them is a dangling link → glibc falls back to nameserver 127.0.0.1 (usually nothing listening) → name resolution fails with no obvious cause. The sandbox still starts successfully, so this surfaces as a confusing "DNS doesn't work" rather than a clear error.
A subtle trap: binding the canonical /run/NetworkManager here does not rescue a link written as /var/run/NetworkManager/resolv.conf, because the intermediate /var/run compat symlink lives in the unmounted /var.
What actually works vs. breaks
The common modern configs are fine (the baseline was clearly designed around them):
| Config | /etc/resolv.conf |
Result |
|---|---|---|
| systemd-resolved (Ubuntu 18.04+, most cloud images) | → /run/systemd/resolve/stub-resolv.conf |
✅ works (/run/systemd/resolve bound) |
| resolvconf | → /run/resolvconf/resolv.conf |
✅ works (/run/resolvconf bound) |
| NetworkManager default mode | real file written into /etc |
✅ works (inside bound /etc) |
| NM symlink mode, canonical | → /run/NetworkManager/resolv.conf |
✅ works (/run/NetworkManager bound) |
static /etc/resolv.conf |
real file | ✅ works |
| WSL | → /mnt/wsl/resolv.conf |
❌ breaks (/mnt unmounted) |
/var/run-routed (older RHEL/CentOS-era) |
→ /var/run/NetworkManager/resolv.conf |
❌ breaks (/var unmounted) |
| admin-custom target outside the 17 baseline paths | → e.g. /etc/dns/resolv.conf on a non-bound mount |
❌ breaks |
So this is a real but minority break, not the broad regression it might first appear — the dominant systemd-resolved / resolvconf / NM-default paths are all covered. Worth noting the PR's VM smoke test ("DNS working with network allowed") almost certainly ran on a systemd-resolved host, i.e. the case that does work, which is why the gap wasn't surfaced.
Suggested fix (cheap, and worth doing regardless):
- Preferred: at runtime,
readlink/etc/resolv.confand--ro-bind-tryits canonical target before emitting the baseline. - Or statically: add
--symlink run /var/run(rescues the entire/var/run/...family without exposing/var's contents) plus/mnt/wsl/resolv.confto the baseline.
Either way, a regression test that points /etc/resolv.conf at a /var/run/... target and asserts the resolved target ends up reachable would lock this in.
There was a problem hiding this comment.
Fixed in 9ae2486 — went with the static option since build_args is deliberately pure/platform-agnostic and unit-testable on every host:
/var/run/...-routed targets: synthesise a--symlink /run /var/runcompat symlink./var/run/NetworkManager/resolv.conf(and the rest of the/var/run/...family) now resolve into the already-bound/run/*DNS dirs. bwrap synthesises an empty/varfor the symlink, so no host/varcontents are exposed.- WSL:
--ro-bind-try /mnt/wsl/resolv.conf(single file, skipped on non-WSL hosts — no/mntexposure).
Added two regression tests (baseline_recreates_var_run_compat_symlink, baseline_includes_wsl_resolv_conf) and updated the backend docs. Verified the symlink + bind resolution empirically against bwrap 0.8.0 (and confirmed the control case fails without the symlink).
Truly custom out-of-baseline targets (e.g. an admin-set /etc/dns/resolv.conf on an unbound mount) still need a readonlyPaths entry; that residual caveat is now documented.
… baseline The deny-by-default baseline never mounts /var or /mnt, so an /etc/resolv.conf symlink routed through /var/run/... (older RHEL/CentOS, some container images) or /mnt/wsl/resolv.conf (WSL) would dangle inside the sandbox and silently break name resolution. Cover the two common out-of-baseline targets without exposing host /var or /mnt contents: - synthesise a `/var/run -> /run` compat symlink so /var/run/...-routed resolv.conf targets resolve into the already-bound /run/* DNS dirs; - `--ro-bind-try` /mnt/wsl/resolv.conf so WSL DNS works (skipped on non-WSL hosts). Add regression tests for both and update the backend docs. Verified the symlink/bind behavior empirically with bwrap 0.8.0. Addresses review feedback from @MGudgin on PR microsoft#482. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Carlos Alexandro Becker <caarlos0@users.noreply.github.com>
📖 Description
Change the Bubblewrap backend's default filesystem posture from "host root mounted read-only" to deny-by-default, matching the macOS Seatbelt backend's
(deny default)baseline.Why
bwrap_command::build_argsused to emit:That bind-mounted the entire host root read-only into every sandbox, so the caller's
$HOME/.aws/credentials,$HOME/.ssh/id_*, browser cookies, etc. were readable inside the sandbox by default. The Seatbelt backend on macOS starts from(deny default)and only allows narrow system paths (SYSTEM_READ_ALLOWinsrc/backends/seatbelt/common/src/profile_builder.rs), so the two backends had a meaningful asymmetry in the confidentiality guarantees they offered. This PR closes that gap.What changes
New baseline (
BASELINE_RO_BIND_PATHS) — mirrors seatbelt'sSYSTEM_READ_ALLOW:/bin,/sbin,/lib,/lib32,/lib64,/libx32(symlinks under/usron merged-usr distros; bwrap follows source-side symlinks so both real-dir and symlinked distros work)./usrsubpaths:/usr/bin,/usr/sbin,/usr/lib,/usr/lib32,/usr/lib64,/usr/libexec,/usr/share— deliberately not/usrwholesale, so/usr/localis not implicitly exposed./etc— whole, like seatbelt's/private/etc. Files with restrictive perms (/etc/shadow,/etc/sudoers,/etc/ssh/ssh_host_*_key) stay unreadable to a non-root caller because user-namespace UID mapping does not bypass kernel DAC./run/systemd/resolve,/run/NetworkManager,/run/resolvconf— needed when/etc/resolv.confis a symlink. Narrow subpaths so/run/user/<uid>(D-Bus session, keyring, ssh-agent sockets) stays hidden./etc/resolv.confsymlinks outside/run: also synthesise a/var/run -> /runcompat symlink (for/var/run/...-routed targets — older RHEL/CentOS-era and some container images) and--ro-bind-try/mnt/wsl/resolv.conf(for WSL), so DNS keeps working under deny-by-default without exposing host/varor/mntcontents.All emitted via
--ro-bind-tryso missing paths are silently skipped (e.g./lib32on x86_64-only systems,/run/systemd/resolveon hosts without systemd-resolved).What disappears from sandbox by default
$HOME,/root,/home/*,/opt,/srv,/mnt,/media,/var,/sys,/usr/local,/run/user/<uid>,/run/dbus. Callers who legitimately need any of these must list them underreadonlyPathsorreadwritePaths.What's preserved
readwritePaths/readonlyPaths/deniedPathssemantics — unchanged.--unshare-*flags, network policy handling, proxy env-var injection, working-dir, env clearing — unchanged.--dev /dev/--proc /proc/--tmpfs /tmpoverlay — unchanged.Drive-by build fix
The second commit (
fix(nanvix): compile as build-dep from non-Linux/Windows hosts) adds empty/zero fallbacks forREQUIRED_BINARIESandNANVIXD_BINARYsonanvix_commoncompiles on macOS hosts when pulled in as a[build-dependency]oflxc/wxcduring cross-compile. Zero runtime impact on supported platforms — the consuming build scripts already gate the surrounding logic behindcfg(target_os = "linux"/"windows")andfeature = "microvm". Separated out so it can be reviewed (or split into its own PR) independently.Breaking change for users
This is a behavior change. Configs that implicitly relied on
$HOME(or/opt,/var,/usr/local, …) being readable will start failing. The migration is to list the directory inreadonlyPaths:{ "filesystem": { "readonlyPaths": ["/home/alice/project", "/usr/local"] } }Documented in the updated "How It Works → Deny-by-default filesystem" and "Limitations" sections of
docs/bwrap-support/bubblewrap-backend.md.🔗 References
No tracking issue — this came out of a direct comparison between the seatbelt and bwrap baselines while reviewing the two unprivileged backends.
Related follow-up (out of scope for this PR):
🔍 Validation
Unit tests (
cargo test -p bwrap_commonfromsrc/) — 25/25 pass, including new regression tests covering the new contract:baseline_does_not_bind_mount_host_root— regression test for the old--ro-bind / /default.baseline_emits_required_ro_bind_try_paths—/bin,/sbin,/lib,/lib64,/usr/bin,/usr/lib,/usr/share,/etcall emitted.baseline_does_not_expose_usr_local— no--ro-bind /usr /usrand no explicit/usr/localentry.baseline_excludes_confidential_paths— no/home,/root,/opt,/srv,/var,/sys,/run/user,/run/dbusbind-mounts.baseline_includes_dns_stub_resolver_dirs— all three DNS dirs emitted via--ro-bind-try.baseline_mounts_precede_policy_mounts— policy mounts can still shadow baseline.baseline_recreates_var_run_compat_symlink— emits--symlink /run /var/run(and never binds host/var) so/var/run/...-routedresolv.confsymlinks resolve.baseline_includes_wsl_resolv_conf— emits--ro-bind-try /mnt/wsl/resolv.conf(and never exposes/mntwholesale) so WSL DNS works.Plus updated
filesystem_policy_produces_correct_mountsto match the new contract (a bare--ro-bind /data /datais now unambiguously the policy mount).Lint / format —
cargo clippy -p bwrap_common --all-targets -- -D warningsclean,cargo fmt --all -- --checkclean.Bubblewrap behavior — verified empirically against bwrap 0.8.0 that the
/var/run -> /runsymlink makes/var/run/NetworkManager/resolv.confresolve into the bound/run/NetworkManager, that a WSL-style/etc/resolv.conf -> /mnt/wsl/resolv.confis readable, and that the control case (no symlink) fails — reproducing the original gap.Linux VM verification — cross-compiled
lxc-execforaarch64-unknown-linux-gnuand ran a 6-config smoke suite on a Linux VM (seesrc/target/vm-test-bundle/locally — gitignored). The suite plantsTOP_SECRET=hunter2in/home/SENTINEL_DO_NOT_LEAK.txton the host and verifies the secret does not appear in sandbox output without an explicitreadonlyPaths: ["/home"], then verifies the opt-in does expose it. Also covers/opt//var//sys//root//usr/localbeing hidden, DNS resolution working with network allowed, and/etc/shadowstaying unreadable via DAC. (Will paste the run output as a PR comment once the VM run is complete.)✅ Checklist
📋 Issue Type
Microsoft Reviewers: Open in CodeFlow